Text differentiation methods, systems, and computer program products for content analysis

ABSTRACT

Provided are improved methods, apparatus, and computer program products for text differentiation which involves identifying differences between documents with similar content, not merely similar terms, and generating results. Text differentiation provides the ability to find non-similar, or different, content hidden within documents with similar overall content, but not exactly the same content. Text differentiation may be used to quickly identify key differences between similar documents.

CROSS-REFERENCE TO RELATED APPLICATIONS

The contents of U.S. Pat. No. 6,611,825, entitled Method and System forText Mining using Multidimensional Subspaces, and U.S. Pat. No.6,701,305, entitled Methods, Apparatus and Computer Program Products forInformation Retrieval and Document Classification UtilizingMultidimensional Subspace, are incorporated by reference.

FIELD OF THE INVENTION

The present invention relates generally to text data analysis and, moreparticularly, to identifying non-similar content between documents withsimilar content.

BACKGROUND

Data mining broadly seeks to expose patterns and trends in data, andmost data mining techniques are sophisticated methods for analyzingrelationships among highly formatted data, i.e., numerical data or datawith a relatively small fixed number of possible values. However, muchof the knowledge associated with an enterprise consists oftextually-expressed information, including databases, reports, memos,e-mail, web sites, and external news articles used by managers, marketanalysts, and researchers.

In comparison to data mining, text data analysis (also sometimes calledtext mining or text analysis), refers to the analysis of text and mayinvolve such functions as text summarization, information visualization,document classification and clustering (e.g., routing and filtering),document summarization, and document cross-referencing. Text dataanalysis may help a knowledge worker find relationships betweenindividual unstructured or semi-structured text documents and semanticpatterns across large collections of such documents. For example, U.S.Pat. Nos. 6,611,825 and 6,701,305 describe particular text data analysismethods. Text data analysis sometimes is a supporting aspect of datamining, but the concepts can be used independently as separateinformation retrieval methods, or together, such as to provide a datamining application that incorporates the ability to analyze text data.

Once a suitable set of documents and terms has been defined for adocument text collection, various document retrieval techniques can beapplied to the collection, such as keyword search methods, naturallanguage understanding methods, probabilistic methods, and vector spacemethods.

Results of document retrieval techniques are typically presented aslists of documents, typically related to search terms. Often the list ofrelated documents is sorted by relevancy to search terms and provideslinked URL references for the knowledge worker to explore the fulldocument. Lists of related documents also are often supplemented byextracts of the documents containing “hits” of the search terms (textsummarization), helping the knowledge worker identify the context anduses of the search terms in the documents. However, a knowledge workerpresented with a list of related documents from a document retrievalapplication only has the benefit of any relevancy ordering or, if adocument retrieval application is supplemented with text data analysis,a knowledge worker only has the additional benefit of such supplementedtechniques as content ordering, text summarization, or classificationfor the document list and extracts from the documents to determine whichdocuments to explore. Unfortunately, this type of typical searchingprocess only provides a knowledge worker with limited information, andsometimes misleading information. For example, documents that includeone or more high frequency terms may receive a misleadingly goodrelevancy score and be elevated in the result list even though thosedocuments include few, if any, of the other terms of the query. Manyvariations on this general searching process have been proposed ordeveloped, such as weighting various terms and reducing the impact ofhigh-frequency terms. Regardless of the improvements of the algorithmsor presentation features, the knowledge worker remains limited byunderlying algorithms and, particularly, the presentation of thedocument results, typically a list of URL references with exemplarydocument extracts. Similarly, results of text data analysis are oftenpresented simply by versioning control that identifies editing and otherdifferences between two documents and does little to help a knowledgeworker analyze the content of one or more documents.

SUMMARY OF THE INVENTION

Embodiments of the present invention for improved text data analysisgenerally may be used to supplement conventional text data analysis,data mining, and document retrieval techniques and applications. Forexample, when a conventional document retrieval technique results in anumber of documents which have similar content, an embodiment of thepresent invention of an improved text data analysis method, system, orcomputer program product may be used to further understand therelationships between these documents. An improved text data analysismethod of an embodiment of the present invention identifies differencesbetween documents with similar content, not merely similar terms, andgenerates results for presentation. Such an improved text data analysismethod and its results would assist a knowledge worker in determiningwhich documents may include content that distinguishes the differentdocuments from the other documents with overall similar content, but notexactly the same content. Such an improved text data analysis methodcould also be used to further support and refine a data miningapplication, but may also be used independently and with otherinformation retrieval applications.

The present invention provides improved methods, apparatus, and computerprogram products for identifying content from text data, such as fromtwo or more documents or sections of documents or from a plurality oftext documents (also referred to as a text data collection). Textdifferentiation is performed by analyzing documents with similar contentto identify non-similar content (i.e., content differences ordifferentiated content) in the documents. Generally a limited set of twoor more documents is analyzed to identify non-similar content, or partsof content. Documents can be from any source, including a sequence ofnews stories updating a particular topic, multiple news stories on aparticular topic from numerous sources, a document cluster, or searchresults from a search engine or database query. The origin of thelimited set of documents may vary depending on an application of anembodiment of the present invention. For example, in a further aspect ofthe invention, text differentiation is performed by identifyingnon-similar content in text of documents from multiple news stories on aparticular topic from numerous sources. Typical news stories about aparticular event describe the same or similar facts related to the eventin different ways. Many news stories are re-tellings of original newsstories about the event. Accordingly, these news stories will be verysimilar documents with similar content. However, the news stories arelikely not identical, and some of the news stories may include contentdifferent from the other news stories, such as factual discrepancies oradditional information. In a further aspect of the invention, textdifferentiation is performed by identifying non-similar content in textof documents identified using a query. Typical search results identifydocuments related to search terms, many of which often include similarcontent. However, these documents are not identical, but may includedifferent emphases, factual discrepancies, different subsets ofinformation, etc. The content differences between the news stories ordocuments can be as important or more important to a knowledge workerthan the content similarities. Finding information common to manydocuments is relatively simple. Identifying content differences betweentwo or more documents has traditionally been and continues toincreasingly become a difficult task, particularly to perform manually.Embodiments of the present invention, however, provide the ability tofind content hidden within two or more documents with similar content.For example, a particular query may include one hundred documents withhigh relevance to the search terms; but a knowledge worker may be mostinterested in those documents that include content in addition to ordifferent from certain common information shared by most or all of thehundred documents. Or a pre-selected, limited set of five documents mayhave very similar content with important differences hidden in one ormore of the documents. A knowledge worker may want to know aboutdifferent keywords, entities (e.g., personal, geographic, company names,governments, organizations, etc.), or subject matter (e.g., section orparagraph topics) included in portions of one or more documents, but notincluded in the majority of documents. A knowledge worker may use textdifferentiation to quickly identify key content differences betweensimilar documents, and avoid spending time reviewing overlapping contentin the similar documents.

According to one aspect, the method identifies non-similar content fromdocuments that contain similar content. Related documents are analyzednot for the content they share but instead for the non-similar contentthat one or more documents adds to, subtracts from (lacks in), or isdifferent from (contradicts) the base of common information. Results ofa text differentiation operation may be generated in a manner thatrepresents the non-similar content. The identification of non-similarcontent may include finding paragraphs of similar content; determiningcontent differences between the paragraphs, such as using an ontologyand an entity or keyword (or topic word) extraction and/or subjectmatter identification mode; and marking the content differences.Alternatively, or in addition, identification of non-similar content mayinvolve determining absolutely unique and/or non-universal content.

The limited set of documents that is analyzed for non-similar contentmay be obtained using a document selection mode such as manuallyidentifying two or more documents or using a set of documents fromsearch results, which may be further refined to reduce set of documentsto a limited set of documents less than the search engine results. Thedocument selection mode may be based upon a query that includes one ormore search terms, a “query by example” input to allow the user to enteror provide an example document or section or paragraph of a document, ora “more like this” selection to allow the user to refine the query to aparticular document or section or paragraph of a document. For example,a plurality of text documents may be analyzed for documents related tosearch terms of a document selection query, and two or more relateddocuments may be extracted for further analysis. The extraction ofrelated documents may be based on a predetermined threshold relevancylimit, such that only documents above the threshold relevancy limit areextracted. Alternatively, or in addition, the extraction of relateddocuments may be limited to a predetermined number of documents with thehighest relevancy scores with respect to the search terms. A usersubmitting a document selection query may be capable of setting orselecting a predetermined threshold relevancy limit or predeterminednumber of documents to be compared for non-similar content.

Results of a method, apparatus, or computer program product of anembodiment of the present invention need not be a visual display, butmay simply be adding or editing database fields related to one or moreof the analyzed documents or modification of or creation of a resultdocument such as an edited XML document with metadata representingresults of a text differentiation operation. Alternatively, or inaddition, an embodiment of the present invention may provide apresentation of text data analysis (often a visual display or depictionwhen presented to a knowledge worker), comparing two or more documentsor sections or paragraphs of two or more documents, such as usinghighlighting to identify content in a document or section or paragraphof a document not present in the other document or documents in the textdata analysis. Further, for example, a presentation of results mayinclude a list with links to the related documents including non-similarcontent. The document links may be listed above or next to abstracts orsummarizations of the non-similar content. Similarly, this aspect of thepresent invention also may provide extraction of a subset of relateddocuments of similar content such that the identification of non-similarcontent may occur independently within each subset of documents and theresults are grouped by these subsets of documents of similar content.

These characteristics, as well as additional details, of the presentinvention are further described herein with reference to these and otherembodiments.

BRIEF DESCRIPTION OF THE DRAWING(S)

Having thus described the invention in general terms, reference will nowbe made to the accompanying drawings.

FIG. 1 is a flow diagram illustrating logic of an embodiment forperforming text differentiation of the present invention.

FIGS. 2A, 2B, 2C, 2D, and 2E are a presentation of a text data analysisand presentation application of an embodiment of the present invention.

FIG. 3 is a flow diagram illustrating logic of an embodiment forperforming text differentiation of the present invention.

FIG. 4 is a block diagram of a general purpose computer system suitablefor implementing an embodiment of the present invention.

DETAILED DESCRIPTION

The present invention will be described more fully with reference to theaccompanying drawings. Some, but not all, embodiments of the inventionare shown. The invention may be embodied in many different forms andshould not be construed as limited to the described embodiments. Likenumbers and variables refer to like elements and parameters throughoutthe drawings.

Although the second example embodiment of the present invention isdescribed with reference to a search engine application and resultsthereof, embodiments of the present invention are not the same as and donot require search engine (document retrieval/query) operations, but maybe used for receiving and analyzing two or more documents from anysource, such as described with reference to the first example embodimentof the present invention. For example, documents may be identifiedand/or provided from search engine results, database search results,clustering of documents of similar type, a sequence of news storiesproviding updates on a particular topic, or multiple news stories on thesame topic from different news sources. Accordingly, the presentinvention is not limited by or applicable only to document retrieval ordata mining applications, but may be used alone as a text data analysisapplication, or combined with various other applications, certainlyincluding, but not limited to, document retrieval and data miningapplications.

“Non-similar content” may be defined as content differences representingcontent unique to a single document in all the searched documents, asingle document in a reduced set of the searched documents, or a limitednumber of documents in a plurality of documents. For example, if asearch of one hundred documents results in twenty relevant documentsthat can be divided into five subsets of related documents of similarcontent, non-similar content may be (i) content unique to a singledocument in the one hundred documents, (ii) content unique to a limitednumber of documents in the one hundred documents, (iii) content uniqueto a single document in the twenty relevant documents, (iv) contentunique to a limited number of documents in the twenty relevantdocuments, (v) content unique to a single document in one of the fivesubsets of related documents of similar content, or (vi) content uniqueto a limited number of documents in one of the five subsets of relateddocuments of similar content. Non-similar content refers to contentdifferences, and is not the same as versioning control that identifiesediting and other differences that are not related to the content of thedocument.

The use of the term “document” is inclusive of merely a portion of adocument, such as a section or a paragraph of a document. Use of bothterms document and section of a document together are not meant todistinguish between the an entire document and a section of a documentbut to emphasize, where potentially less apparent, that less than awhole document may apply and is expressly included, even though alreadyincluded through use of the term document. In addition, the term“document” also encompasses text generated from images and graphics ortext generated from audio and video objects, or other multimediaobjects.

As mentioned, embodiments of the present invention are further describedwith reference to content searches on the Internet, content searches ofcorporate, organization, or governmental databases, and content searchesof other types of document repositories. For example, embodiments of thepresent invention may be used to compare numerous similar documentsreturned from searches on the Internet. Text differentiation of thepresent invention identifies the ways in which documents are differentin content and can be used for any task that involves comparing two ormore documents with similar content where the content differencesbetween the documents are of interest, such as tracking new developmentsin ongoing news stories, although text differentiation can be used inany application, including, but not limited to, intelligence, marketing,data management, and research. Similarity in content refers tocommonalities in subject matter, topics, and/or events, not merelycommonalities in similar terms. For example, two documents that bothinclude the terms “2005,” “Saturn,” and “project” may not be similar incontent by the fact that one document refers to a 2005 project relatedto the planet Saturn and the other document may be a web blog of a childtalking about receiving a 2005 Saturn for his or her sixteenth birthdayand a project at school. Similarity in content refers instead todocuments on the same subject matter, topic(s), and/or event(s), whichwill typically also include commonalities in terms as a consequence ofbeing similar in content.

The methods, apparatus, and computer program products of the presentinvention perform text differentiation operations and, moreparticularly, the identification of non-similar content from documentsof similar content within the plurality of documents. In performingthese operations, the methods, apparatus, and computer program productsof the present invention are capable of using one or more data miningprocess to support the analysis and extraction of documents,identification of non-similar content, and presentation of results. Forexample, an embodiment of the present invention may use the textrepresentation using subspace transformation data mining processes ofU.S. Pat. Nos. 6,611,825 and 6,701,305 for identifying non-similarcontent in the extracted documents with similar content. Accordingly, byusing one or more data mining process, the methods, apparatus, andcomputer program products of the present invention are capable ofprocessing a large data set without requiring prior knowledge of thedata, thereby identifying non-similar content in documents.

FIG. 1 is a flow diagram illustrating logic for performing textdifferentiation of an embodiment of the present invention. The logicbegins by identifying a limited set of documents with similar content atblock 6. The limited set of documents can be from any source, includingsearch results from a search engine or database query, a documentcluster, a sequence of news stories updating a particular topic, ormultiple news stories on a particular topic from numerous sources. Theorigin of the limited set of documents may vary depending on theapplication of an embodiment of the present invention. For example, theflow diagram of FIG. 3 is representative of an embodiment of the presentinvention in a search engine application for identifying non-similarcontent in text of documents related to search terms of a query.

Once a limited set of documents is identified, text differentiationinvolves receiving the documents, as shown at block 12, to permit asystem, apparatus, or method of an embodiment of the present inventionto analyze the documents for information of non-similar content as shownat block 14. One or more text data analysis and/or data mining processes10 may be used at block 14 to perform analysis of the limited set ofdocuments. For example, a text data analysis process or a data miningprocess, such as described in U.S. Pat. Nos. 6,611,825 and 6,701,305,may identify, or at least attempt to find, paragraphs, entities, and/orsubject matter of similar content. The same or a different text dataanalysis and/or data mining process 10 may be used to determine contentdifferences between paragraphs identified as having similar content,and/or a text data analysis or data mining process may determine contentdifferences between the documents as a whole. For example, TRUST andother text data analysis and/or data mining technologies may be used toidentify differences in a collection of documents which are very similarin content. One example for identifying differences in a collection ofdocuments is provided. A user issues a query (usually initiated byentering a set of one or more terms and/or a document, a section of adocument, or other database matches) to retrieve a set of matcheddocuments. The user then may use TRUST representation of the returneddocuments in conjunction with a clustering algorithm (such as theK-Means clustering algorithm) to cluster the returned set of documentsto identify groups of documents that are very similar in content. Forexample, if the query is “Saturn 2005,” the clustering results mightreveal that there are three clusters of returned documents. One setabout the planet Saturn, another about the Saturn automobile, and thethird about a corporate project named Saturn. If the user's interest isthe automobile, the second cluster may be selected for furtherinvestigation for differences. Other technologies may be used toaccomplish this as well. Alternatively, the user may use “query byexample” (i.e., “more like this” selection in some search engines) tofind a set of documents that are very similar to one of the returneddocuments of interest. Other ways of obtaining a set of two or moredocuments that are highly similar in content may be used. As searchingis refined, and document sets are decreased to fewer documents, thecontent of the documents typically will be more similar. The system maythen use TRUST or one or more other technologies to generate keywordsfor each of the paragraphs in each document in this document set. Thesemay be used to help compare where these documents differ. Even documentswith very similar content can have differences. For example, if document1 and document 2 each has 5 paragraphs, TRUST may identify thatparagraphs 1, 2, 3 and 4 in each document are practically identical, butthat paragraphs 5 have lower similarity scores, and, thus, may havecontent differences. However, while the previous example of textdifferentiation describes a paragraph-by-paragraph operation, otherembodiments of the present invention need not operate at aparagraph-by-paragraph level, but may operate, for example, bynon-corresponding sections of text or simply at a document level. Thesystem may use one or more entity extraction technologies to identifyimportant entities (e.g. person names, locations, time, company names,etc.) in each document in this set. A keyword or subject matterextraction technology may also be used. For example, an extraction modeof a text data analysis or data mining process may be used to identify,or mark for result presentation purposes such as at block 16, specificinstances of content, such as non-similar content uniquely occurring ina single document (absolute uniqueness of non-similar content) orcontent not occurring universally throughout all of the documents(non-universal non-similar content). An extraction step may optionallybe guided by use of an ontology. For example, a text data analysis ordata mining process may be guided by an ontology related to a user'sinterests, such as derived from a query input in a search engineapplication. Often there are existing ontologies that users may useand/or modify for use, such as the WordNet of Princeton University,available at http://wordnet.princeton.edu/. Many companies haveestablished similar enterprise-level ontologies as well, such as TheBoeing Company's Technical Library's Thesaurus Terms. By way of anexample use of an ontology, if the user's interest is the automobileSaturn, the user may select an ontology that only picks out differencesin features and price in an automobile, but ignores any informationabout the spokesman or reviewer mentioned in the articles, i.e., ignoresreferences to people or at least particular people. The system mayperform a combination of one or more of the steps above, or other steps,to identify differences in a set of similar documents.

The function of an embodiment of the present invention of analyzingdocuments of similar content to identify non-similar information may becompared to the process of de-dupping, also referred to as de-duping andde-duplicating. De-dupping commonly refers to removing duplicate recordsin databases, such as removing all but one set of identical or nearlyidentical documents in a database like a library catalog. De-duppingalso refers to removing repeated values from an input vector inmathematics, returning a new vector that has just one copy of eachdistinct value in the input; avoiding duplicate entries or elevatedweightings or counts in document summaries of metadata. Although not anidentical task or feature to de-dupping, an embodiment of the presentinvention may analyze documents of similar content by ignoring contentin a document that is similar to content in another document,effectively de-dupping the common content and focusing only on thenon-similar content in the documents.

By way of example, a presentation of results of a text data analysisembodiment for text differentiation of the present invention is providedin FIGS. 2A, 2B, 2C, 2D, and 2E. Each figure represents a section of anews story related to the same event, the announcement of a Boeing andIBM partnership. While the new stories describe the same event, thestories are not identical. Typical news stories about a particular eventdescribe the same or similar facts related to the event in differentways. Many news stories are re-tellings of original news stories aboutthe event. Thus, many news stories on the same event will be verysimilar documents with similar content. However, the news stories maynot be identical, and some of the news stories may include contentdifferent from the other news stories, such as factual discrepancies oradditional information. Accordingly, text differentiation may beperformed to identify the non-similar content in the text of the newsstories.

The text differentiation which has been performed on the news stories ofFIGS. 2A, 2B, 2C, 2D, and 2E, has highlighted various terms (words,phrases, numbers, etc.), to allow a knowledge worker the ability toquickly review the documents to identify the key terms (underlined oralternatively, for example, in a color such as blue text), secondaryterms (bolded and italicized or alternatively, for example, in a colorsuch as green text) such as identified by an ontology or thesaurus inrelation to the key terms, and differentiated terms (underlined boldedtext or alternatively, for example, in a color such as red text). Resultpresentations may be in many forms, such as the presentation ofhighlighted sections of relevant text in FIGS. 2A, 2B, 2C, 2D, and 2E.While results may be presented in various manners, highlighting, such astext coloring, text backgrounds, bolding, underlining, sizing, font,brightness, flashing, moving, etc., may be useful for allowing aknowledge worker the ability to visually identify content differences inmultiple documents. Similarly, instead of merely having two extremes,such as identical or opposites, text differentiation may presentnon-similar content (differentiated content) along a continuum, such aspresented by gradations of color or brightness where the brighter theterm, the less similar and more different the term is from what may befound in the other documents. Thus, if some content is absolutelyunique, it may be presented most brightly; if some other content isnext-to-absolutely unique where one or two other documents present thesame content, it may be presented slightly less bright; and if somefurther content is not-universally unique where three or more otherdocuments include the same content, the content may be presented evenless brightly. Similar other approaches may be used to present textdifferentiation along a continuum of non-similarity, including, but notlimited to, using a differentiation scoring methodology, rather than auniqueness methodology, to determine gradations of contrast, such aswhere content completely different from content in at least one otherdocument (e.g., $200 million instead of $200 billion) receives adifferentiation scoring of 100% correlating to a contrast of 100% so thecontent is entirely visible, content close but not identical to allother documents (e.g., $1.321 million instead of $1.32 million) receivesa differentiation scoring of 75% correlating to a contrast of 75%, andcontent the same as all in the other documents receives adifferentiation scoring of 50% correlating to a contrast of 50%, therebyemphasizing what content remains most visible as the quantitatively morenon-similar content, and decreasing non-similarity by visibility, andaiding a knowledge worker to quickly focus on non-similar content. Onemethod of preparing the results of text differentiation is to create amarkup file, such as an XML or HTML file, of the analyzed documents orsections of text, but any number of manners of preparing and presentingresults of text differentiation may be used.

In effect, an embodiment of the present invention is capable ofidentifying non-similar content (differentiated content) between two ormore documents or sections of documents having similar content, such asthe set of news stories on the same event of FIGS. 2A, 2B, 2C, 2D, and2E. Typically, the more similar the content of the documents or sectionsof documents, the more useful and better the results of an embodiment ofthe present invention, because differences will likely appear morehidden or buried in one or more of the documents, and will easily beidentified as a difference that a knowledge worker can quickly focus onfor consideration and/or analysis.

To further describe the present invention, an embodiment is describedbelow in a search engine application. FIG. 3 is a flow diagramillustrating logic for performing text differentiation of an embodimentof the present invention in a search engine application. The logic movesfrom a start block 32 to a search terms entry block 34 representing theinitiation of a query by a knowledge worker. A query may include one ormore search terms entered in a conventional manner using text input.Alternatively, an embodiment of the present invention may initiate aquery by a knowledge worker importing one or more documents, orsimilarly inputting a portion of at least one document. The document(s),or portion(s) thereof, can be interpreted by a text data analysis and/ordata mining process to extract search terms, such that the content ofthe document(s), or portion(s) thereof, becomes the search terms for thequery.

A document collection 36, representing a plurality of text documents, isacquired, selected, known, or otherwise accessible for performing textdifferentiation. Text differentiation of the present invention involvescomparing two or more documents to identify non-similar content betweenthe documents, typically comparing documents with similar content byextracting documents with similar content from a document collection.For example, the document collection 36 may be documents that aresearchable using a particular corporate database search, a search engineapplication on the Internet, or the like.

One or more text data analysis and/or data mining process 40 are used atblock 38 to analyze the documents in the plurality of text documents toidentify documents that are related to the search terms of the query.For example, a text data analysis and/or data mining process 40 mayattempt to identify documents with high relevancy scores with respect tothe search terms. Different conventional data mining processes may beused to analyze the document collection 36. The relevancy analysis ofblock 38 is provided for the extraction of relevant documents at block42.

At block 42 one or more text data analysis and/or data mining process 40may be used to extract a limited set of documents related to the searchterms. Accordingly, text differentiation may be performed on any numberof documents from the original data collection 36. Extracting a limitedset of documents related to the search terms narrows the focus for theidentification of non-similar content. For example, the textdifferentiation process may be configured to extract only relateddocuments that exceed a predetermined threshold relevancy limit orconfigured to extract only a predetermined number of documents. If onlya few similar documents relate to the search terms of the query, thedocuments may contain significant amounts of non-similar content. Bylimiting to highly similar documents that are extracted, the textdifferentiation may identify non-similar content in documents that arerelated to the search terms of the query and are very similar to eachother, thereby reducing the non-similar content between the documents.When many documents are very similar, text differentiation of thepresent invention is particularly useful, because it can identify thenon-similar content, thereby allowing the knowledge worker to decidewhere to focus his or her attention and assist him or her in assemblingand fusing information from multiple documents. An embodiment of thepresent invention may extract documents related to search terms intosubsets of documents of similar content. Then these subsets of documentscan be analyzed separately for non-similar content.

After extracting a limited set of documents related to the search terms,at block 44 one or more text data analysis and/or data mining process 40may be used to analyze the extracted documents for non-similar content.For example, non-similar content may include different keywords,entities (personal, geographic, company names, governments,organizations, etc.), or subject matter (section or paragraph topics)included in one or more documents, but not included in the majority ofdocuments.

At block 46 an embodiment of the present invention highlightsdifferences and/or presents the results of the query. Results of textdifferentiating can take any number of forms, just as conventionalsearch results are provided in various forms. One typical presentationmay present relevant sections of the compared documents withhighlighting (such as in text coloring, text background coloring,bolding text, etc.) to identify content differences using an HTML or XMLmarkup document. Another presentation may list the extracted documentswith abstracts or summaries of the non-similar content for each documentprovided below a URL link to each document. In this manner, a knowledgeworker can scan the list of results for non-similar content to identifydocuments that might include different or additional content of interestto the knowledge worker. If related documents are extracted into subsetsof documents with similar content, the presentation of results may beorganized by the subsets of documents.

Each block or step of the flowcharts and combinations of blocks or stepsin the flowcharts of FIGS. 1 and 3 can be implemented by computerprogram instructions or other means. Although computer programinstructions are discussed below, an apparatus according to the presentinvention can include other means, such as hardware or some combinationof hardware and software, including one or more processors orcontrollers for performing text differentiation.

In this regard, FIG. 4 depicts the apparatus of one embodiment includingseveral of the key components of a general purpose computer 50 on whichthe present invention may be implemented. A computer may include manymore components than those shown in FIG. 4. However, it is not necessarythat all of these generally conventional components be shown in order todisclose an illustrative embodiment for practicing the invention. Thecomputer 50 includes a processing unit 60 and a system memory 62 whichincludes random access memory (RAM) and read-only memory (ROM). Thecomputer also includes nonvolatile storage 64, such as a hard diskdrive, where data is stored. The apparatus of the present invention canalso include one or more input devices 68, such as a mouse, keyboard,etc. A display 66 is provided for viewing text mining data, andinteracting with a user interface to request text mining operations. Theapparatus of the present invention may be connected to one or moreremote computers 70 via a network interface 72. The connection may beover a local area network (LAN) or a wide area network (WAN), andincludes all of the necessary circuitry for such a connection. In oneembodiment of the present invention, the document collection includesdocuments on an Intranet. Other embodiments are possible, including alocal document collection, i.e., all documents on one computer,documents stored on a local or network server, documents stored on aclient in a network environment, etc.

Typically, computer program instructions may be loaded onto the computer50 or other programmable apparatus to produce a machine, such that theinstructions which execute on the computer or other programmableapparatus create means for implementing the functions specified in theflowchart block(s) or step(s). These computer program instructions mayalso be stored in a computer-readable memory, such as system memory 62,that can direct a computer or other programmable apparatus to functionin a particular manner, such that the instructions stored in thecomputer-readable memory produce an article of manufacture includinginstruction means which implement the function specified in theflowchart block(s) or step(s). The computer program instructions mayalso be loaded onto the computer or other programmable apparatus tocause a series of operational steps to be performed on the computer 50or other programmable apparatus to produce a computer implementedprocess such that the instructions which execute on the computer 50 orother programmable apparatus provide steps for implementing thefunctions specified in the flowchart block(s) or step(s).

Accordingly, blocks or steps of the flowcharts of FIGS. 1 and 3 supportcombinations of means for performing the specified functions,combinations of steps for performing the specified functions and programinstruction means for performing the specified functions. For example, adata input software tool of a search engine application is an examplemeans for receiving a query including one or more search terms. Similarsoftware tools of applications of embodiments of the present inventionare means for performing the specified functions. Each block or step ofthe flowchart, and combinations of blocks or steps in the flowcharts,can be implemented by special purpose hardware-based computer systemswhich perform the specified functions or steps, or combinations ofspecial purpose hardware and computer instructions. For example, aninput of the present invention may include computer software forinterfacing a processing element with a user controlled input device,such as a mouse, keyboard, touch screen display, scanner, etc. An outputof the present invention may include the combination of displaysoftware, video card hardware, and display hardware. And a processingelement may include a controller, such as a central processing unit(CPU) with a printed circuit board or microprocessor, arithmetic logicunit (ALU), and control unit.

The invention should not be limited to the specific disclosedembodiments. Specific terms are used in a generic and descriptive senseonly and not for purposes of limitation.

1. A method of document content analysis for identifying non-similarcontent from a text data collection, wherein the method comprises:receiving the two or more text documents defining the text datacollection, wherein each of the text documents comprises a plurality ofterms, and wherein at least two of the text documents comprise similarcontent; identifying the non-similar content of the received textdocuments; and generating results based at least partly on thenon-similar content of the received text documents, wherein generatingresults comprises presenting a display identifying the non-similarcontent of the text documents.
 2. (canceled)
 3. A method according toclaim 1, wherein the generation of results further comprises presentingat least portions of the two or more text documents, and whereinnon-similar content of the text documents is identified by highlightingthe non-similar content in the presented portions of the text documents.4. A method according to claim 1, wherein the identification ofnon-similar content comprises determining non-similar content along arange of uniqueness based at least in part of the amount of contentshared with the other documents, wherein the generation of resultscomprises marking the non-similar content with a correlation to therange of uniqueness of the non-similar content.
 5. A method according toclaim 4, wherein the marking the non-similar content with a correlationto the range of uniqueness of the non-similar content is performed byadjusting the displayed brightness of color of the text of thenon-similar content.
 6. A method according to claim 1, wherein theidentification of non-similar content comprises: finding sections ofsimilar content; determining differences between the sections; andmarking the differences, such that the marked differences can be usedfor generating results.
 7. A method according to claim 6, wherein thedetermination of differences comprises combining the use of an ontologyof user interest and an extraction mode selected from the group of:entity extraction, keyword extraction, and subject matteridentification, wherein the ontology of user interest is derived atleast in part from receiving one or more terms of interest.
 8. A methodaccording to claim 1, wherein the identification of non-similar contentcomprises: determining non-similar content absolutely unique to any onereceived text document; and determining non-similar content notuniversal to all other received text documents.
 9. A method according toclaim 1, wherein the receiving of text documents comprises obtaining alimited set of text documents from one of the document selection modesselected from the group of: manual selection, search engine results,database search results, document clustering, news story sequencing, andnews story source compilations.
 10. A method according to claim 9,wherein obtaining a limited set of text documents with similar contentcomprises refining the results of a search engine query.
 11. A methodaccording to claim 1, further comprising receiving a query including oneor more search terms; analyzing a plurality of text documents fordocuments related to the search terms; and extracting two or more textdocuments related to the search terms for being received, wherein theextracted text documents include similar content.
 12. A method accordingto claim 11, wherein: the extraction of related text documents comprisesextracting subsets of related text documents with similar content; theidentification of non-similar content comprises identifying non-similarcontent of each extracted subset of related text documents with similarcontent; and the generation of results further comprises ordering thequery results by the subsets of text documents of similar content.
 13. Amethod according to claim 11, wherein the receipt of the query comprisesimporting at least a portion of at least one document, wherein thecontents of the document are the search terms, and wherein the analysisof the plurality of text documents comprises identifying text documentssimilar in content to the imported document.
 14. A method according toclaim 13, wherein the extraction of related text documents comprisesextracting the set of text documents representing the documentsidentified as similar in content to the imported document.
 15. A methodaccording to claim 11, wherein the extraction of related text documentscomprises discarding all but a predetermined number of text documentswith the highest computed relevancy score with respect to the searchterms.
 16. A computer program product for performing document contentanalysis to identify non-similar content from a text data collection,the computer program product comprising a computer-readable storagemedium having a computer program stored therein the computer programcomprising: first computer program code for receiving two or more textdocuments defining the text data collection wherein each of the textdocuments comprises a plurality of terms and wherein at least two of thetext documents comprise similar content; second computer program codefor identifying the non-similar content of the received text documents;third computer program code for generating results based at least partlyon the non-similar content of the received text documents wherein saidthird computer program code further comprises computer program code forpresenting a display identifying the non-similar content of the receivedtext documents.
 17. A computer program product according to claim 16,wherein the second computer program code further comprises computerprogram code for finding sections of similar content in the textdocuments, determining differences between the sections of similarcontent in the documents, and marking the differences, such that themarked differences are capable of being used by the third computerprogram code.
 18. A computer program product according to claim 16,wherein the second computer program code further comprises computerprogram code for determining non-similar content absolutely unique toany one received text document and determining non-similar content notuniversal to all other received text documents.
 19. A computer programproduct according to claim 16, further comprising fourth computerprogram code for receiving a query including one or more search terms,wherein the second computer program code identifies non-similar contentat least in part based on the search terms.
 20. A computer programproduct according to claim 19, further comprising: fifth computerprogram code for analyzing the plurality of text documents for documentsrelated to the search terms; and sixth computer program code forextracting two or more text documents from the plurality of textdocuments, the extracted text documents determined to be related to thesearch terms by the fifth computer program code, wherein the extractedtext documents are capable of being received by the first computerprogram code.
 21. (canceled)
 22. A computer program product according toclaim 20, wherein the sixth computer program code further comprisescomputer program code for computing a relevancy score for documentsbased at least in part on the relationship of the content of thedocument to the search terms, and wherein the computer program code forextracting text documents limits the number of text documents extractedbased at least in part on the computed relevancy scores of thedocuments.
 23. A computer program product according to claim 19, whereinthe third computer program code further comprises computer program codefor presenting results of the query based at least partly on thenon-similar content of the extracted text documents.
 24. An apparatusfor performing document content analysis to identify non-similar contentfrom a text data collection, the apparatus comprising: an input capableof receiving two or more text documents defining the text datacollection wherein each of the text documents comprises a plurality ofterms, and wherein at least two of the text documents comprise similarcontent; a processing element capable of analyzing the receiveddocuments to identify non-similar content of the received text documentsand capable of generating results based at least partly on thenon-similar content of the received text documents, wherein generatingresults comprises processing data for the presentation of a displayidentifying the non-similar content of the received text documents; andan output capable of receiving the results.
 25. An apparatus accordingto claim 24, wherein the output comprises a display monitor capable ofdisplaying the an identification of the non-similar content of thereceived text documents.
 26. An apparatus according to claim 24,wherein: the input is further capable of receiving a query including oneor more search terms; and the processing element is further capable ofanalyzing the plurality of text documents for documents related to thesearch terms, extracting text documents related to the search terms, andpresenting the extracted text documents to the input.
 27. An apparatusaccording to claim 26, wherein the output is further capable ofpresenting results of the query based at least partly on the non-similarcontent of the extracted text documents.
 28. An apparatus according toclaim 26, wherein the processing element is further capable ofextracting subsets of related text documents with similar content andpresenting the extracted text documents to the input.
 29. An apparatusaccording to claim 28, wherein the output is further capable ofpresenting results of the query based at least partly on the non-similarcontent of the extracted text documents, wherein presentation of resultsis ordered by the subsets of text documents.
 30. An apparatus accordingto claim 26, wherein the input is further capable of importing at leasta portion of at least one document, wherein the contents of the documentare the search terms, and wherein the processing element is furthercapable of identifying text documents similar in content to the importeddocument.
 31. An apparatus according to claim 26, wherein the processingelement is further capable of excluding all but a predetermined numberof text documents with the highest computed relevancy score with respectto the search terms for extraction.